In this project we will be working with the UCI adult dataset. We will be attempting to predict if people in the data set belong in a certain class by salary, either making <=50k or >50k per year.
Typically most of your time is spent cleaning data, not running the few lines of code that build your model, this project will try to reflect that by showing different issues that may arise when cleaning data.
Read in the adult_sal.csv file and set it to a data frame called adult.
Check the head of adult
You should notice the index has been repeated. Drop this column.
library(dplyr)
adult <- select(adult,-X)
Check the head,str, and summary of the data now.
Notice that we have a lot of columns that are cateogrical factors, however a lot of these columns have too many factors than may be necessary. In this data cleaning section we'll try to clean these columns up by reducing the number of factors.
Use table() to check out the frequency of the type_employer column.
How many Null values are there for type_employer? What are the two smallest groups?
Combine these two smallest groups into a single group called "Unemployed". There are lots of ways to do this, so feel free to get creative. Hint: It may be helpful to convert these objects into character data types (as.character() and then use sapply with a custom function)
What other columns are suitable for combining? Combine State and Local gov jobs into a category called SL-gov and combine self-employed jobs into a category called self-emp.
Use table() to look at the marital column
Reduce this to three groups:
Check the country column using table()
Group these countries together however you see fit. You have flexibility here because there is no right/wrong way to do this, possibly group by continents. You should be able to reduce the number of groups here significantly though.
Use table() to confirm the groupings
Check the str() of adult again. Make sure any of the columns we changed have factor levels with factor()
We could still play around with education and occupation to try to reduce the number of factors for those columns, but let's go ahead and move on to dealing with the missing data. Feel free to group thos columns as well and see how they effect your model.
#install.packages('Amelia',repos = 'http://cran.us.r-project.org')
library(Amelia)
Convert any cell with a '?' or a ' ?' value to a NA value. Hint: is.na() may be useful here or you can also use brackets with a conditional statement. Refer to the solutions if you can't figure this step out.
Using table() on a column with NA values should now not display those NA values, instead you'll just see 0 for ?. Optional: Refactor these columns (may take awhile). For example:
Play around with the missmap function from the Amelia package. Can you figure out what its doing and how to use it?
You should have noticed that using missmap(adult) is bascially a heatmap pointing out missing values (NA). This gives you a quick glance at how much data is missing, in this case, not a whole lot (relatively speaking). You probably also noticed that there is a bunch of y labels, get rid of them by running the command below. What is col=c('yellow','black') doing?
missmap(adult,y.at=c(1),y.labels = c(''),col=c('yellow','black'))
Use na.omit() to omit NA data from the adult data frame. Note, it really depends on the situation and your data to judge whether or not this is a good decision. You shouldn't always just drop NA values.
# May take awhile
#str(adult)
Use missmap() to check that all the NA values were in fact dropped.
Although we've cleaned the data, we still have explored it using visualization.
Check the str() of the data.
Use ggplot2 to create a histogram of ages, colored by income.
Plot a histogram of hours worked per week
Rename the country column to region column to better reflect the factor levels.
Create a barplot of region with the fill color defined by income class. Optional: Figure out how rotate the x axis text for readability
Now it's time to build a model to classify people into two groups: Above or Below 50k in Salary.
Refer to the Lecture or ISLR if you are fuzzy on any of this.
Logistic Regression is a type of classification model. In classification models, we attempt to predict the outcome of categorical dependent variables, using one or more independent variables. The independent variables can be either categorical or numerical.
Logistic regression is based on the logistic function, which always takes values between 0 and 1. Replacing the dependent variable of the logistic function with a linear combination of dependent variables we intend to use for regression, we arrive at the formula for logistic regression.
Take a quick look at the head() of adult to make sure we have a good overview before going into building the model!
Split the data into a train and test set using the caTools library as done in previous lectures. Reference previous solutions notebooks if you need a refresher.
Explore the glm() function with help(glm). Read through the documentation.
help(glm)
Use all the features to train a glm() model on the training data set, pass the argument family=binomial(logit) into the glm function.
If you get a warning, this just means that the model may have guessed the probability of a class with a 0% or 100% chance of occuring.
Check the model summary
We have still a lot of features! Some important, some not so much. R comes with an awesome function called step(). The step() function iteratively tries to remove predictor variables from the model in an attempt to delete variables that do not significantly add to the fit. How does it do this? It uses AIC. Read the wikipedia page for AIC if you want to further understand this, you can also check out help(step). This level of statistics is outside the scope of this project assignment so let's keep moving along
help(step)
Use new.model <- step(your.model.name) to use the step() function to create a new model.
You should get a bunch of messages informing you of the process. Check the new.model by using summary()
You should have noticed that the step() function kept all the features used previously! While we used the AIC criteria to compare models, there are other criteria we could have used. If you want you can try reading about the variable inflation factor (VIF) and vif() function to explore other options for comparison criteria. In the meantime let's continue on and see how well our model performed against the test set.
Create a confusion matrix using the predict function with type='response' as an argument inside of that function.
You'll notice we have a rank deficient fit. Find out more about what issues this may cause by reading this stackexchange post.
What was the accuracy of our model?
Calculate other measures of performance like, recall or precision.
In your opinion, how good was this model? What other context would you like to know before answering that question?
No right/wrong answers here, just want you to think about accuracy,precision, and recall. You would like to know the costs associated with each.